Intro to research toolbox

Jeremy Lew

May 24, 2024

Topics

  1. Introduction & learning plan
  2. Getting comfortable with R
  3. Programming fundamentals
  4. Data cleaning
  5. Data visualisation
  6. Statistical analysis
  7. Presentation of results

Introduction & learning plan

Motivation

  1. Added tools to your toolbox
  2. Transferrable skill
  3. Learning how to program in 1 language helps you learn other languages
  4. Leverage on open source tools/projects written by others

Learning plan

  1. The topics in this deck mirror the workflow of a project

flowchart LR
  D[Data exported\nfrom REDCAP]-->C[Data cleaning]
  C-->S[Statistical analysis]
  S-->P[Presentation\nof results]

  1. The game plan is to first do a quick pass through from data cleaning to analysis:
    1. Take a toy dataset
    2. “Clean” the data (minimally)
    3. Run a regression model
  2. Once we have gotten comfortable with R, we will explore each topic in greater detail
    1. Data cleaning
    2. Programming fundamentals
    3. Analysis: generalised linear models
    4. Presentation of results

Resources

  1. R courses
  2. The Epidemiologist R Handbook
  3. R for Data Science
  4. Advanced R
  5. Cheat Sheets

Getting comfortable with R

What is an R package?

  1. It is a collection of code/data written by someone and packaged for distribution
  2. Hosted on the Comprehensive R Archive Network and/or Github for public download
  3. From the CRAN webpage, we are able to find a reference manual documenting the data/functions of the package
  4. From the Github repository, we are able to see the actual codes implemented by the package’s author(s)
  5. To install a package we would run install.packages("palmerpenguins") in the console
  6. To load a package, we would run library(palmerpenguins)
  7. To uninstall a package we would run remove.packages("palmerpenguins") in the console

[DEMO] Intro to RStudio workspace

  1. Console: where you run/execute lines of code

  2. To assign a variable, we use the <- operator

    x <- 1:10; print(x)
     [1]  1  2  3  4  5  6  7  8  9 10
    1. random-access memory is allocated when a variable is assigned/declared
    2. random-access memory is freed-up when the variable is deleted (or garbage collected if it goes out of scope)
  3. Environment: where you’re able to see variables stored in random-access memory

    1. To clear variables from environment \(\rightarrow\) execute rm(list = ls()) in console

[DEMO] Intro to RStudio workspace

  1. Script
    1. Writing code and saving the script
    2. Execute single/multiple lines of code \(\rightarrow\) highlight code + Ctrl-enter
    3. Runs from top to bottom
    4. Comments will not be executed
  2. Hotkeys
    1. Clear the console \(\rightarrow\) Ctrl-l
    2. Clear a line in console \(\rightarrow\) Ctrl-d

[PRACTISE] Tasks to do

  1. For this task, you will need the palmerpenguins and tidyverse packages. Install them if you have not already done so
  2. We will make use of the penguins dataset from palmerpenguins package
  3. Understand the data
  4. Dichotomise body_mass_g (continuous) into a categorical variable with 2 categories (“light”: <= 4750g and “heavy”: > 4750g)
  5. Run a linear regression1 with body_mass_g as the dependent variable and independent variables: species, bill_length_mm, bill_depth_mm, flipper_length_mm and sex
  6. Run a logistic regression with the categorical body mass variable you created earlier as the dependent variable, with the same independent variables as before

Programming Fundamentals

Data types

  1. Character data type e.g. "a"

    typeof("a")
    [1] "character"
  2. Integer data type e.g. 1L

    typeof(1L)
    [1] "integer"
  3. Numeric data type e.g. 5.0

    typeof(5.0)
    [1] "double"
  4. Logical data type e.g. TRUE or FALSE

    typeof(TRUE)
    [1] "logical"
  5. Complex data type e.g. 1 + 4i

    typeof(1 + 4i)
    [1] "complex"

Data types (continued)

  1. Factor datatype for working with categorical variables

    x <- factor(sample(c("Malay", "Others", "Indian", "Chinese"), 10, replace = T),
                levels = c("Chinese", "Malay", "Indian", "Others"))
    print(x)
     [1] Indian  Chinese Malay   Malay   Chinese Chinese Chinese Malay   Malay  
    [10] Others 
    Levels: Chinese Malay Indian Others
    print(levels(x))
    [1] "Chinese" "Malay"   "Indian"  "Others" 
  2. A factor variable is more useful than a plain character vector. E.g. the first level will be used by glm as the reference category

Data structures (vectors)

  1. Data structures are containers for holding data

  2. A character vector holds multiple characters

    letters
     [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
    [20] "t" "u" "v" "w" "x" "y" "z"
    typeof(letters)
    [1] "character"
  3. A numeric vector holds multiple numbers

    x <- 1:5; print(x)
    [1] 1 2 3 4 5
    typeof(x)
    [1] "integer"

Exercise

  1. Can you query the length of your character vector? length(letters)
  2. Can you slice the character vector to get the first 5 elements? letters[1:5]
  3. Can you slice the character vector to get the 1st, 9th, 20th elements? letters[c(1, 9, 20)]
  4. Can you flip the character vector from last to first element? letters[length(letters):1]

Data structures (lists)

  1. Vectors can only hold 1 kind of data type e.g. characters, numeric etc
  2. Lists are “containers” that can hold multiple data types
    1. Unnamed lists

      list("a", 1, 10:15)
      [[1]]
      [1] "a"
      
      [[2]]
      [1] 1
      
      [[3]]
      [1] 10 11 12 13 14 15
    2. Named lists

      list(a = "a", b = 1, c = 10:15)
      $a
      [1] "a"
      
      $b
      [1] 1
      
      $c
      [1] 10 11 12 13 14 15

Data structures (lists, continued)

Exercise

  1. Create a variable in your console: mylist <- list(a = "a", b = 1, c = 10:15)
  2. Can you query the length of mylist? length(mylist)
  3. Can you get the second element of your mylist using
    1. index: mylist[1]
    2. key: mylist$a or mylist[["a"]]
  4. Compare the difference between mylist[3] and mylist[[3]]
  5. Compare the difference between class(mylist[3]) and class(mylist[[3]])

Data structures (dataframe)

  1. Dataframe is a tabular container that can hold multiple data types like lists, but each column can only store data of the same data type

  2. To view the first 2 rows of the penguins dataset from the palmerpenguins package

    library(tidyverse)
    library(palmerpenguins)
    penguins %>% head(2)
    # A tibble: 2 × 8
      species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
      <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
    1 Adelie  Torgersen           39.1          18.7               181        3750
    2 Adelie  Torgersen           39.5          17.4               186        3800
    # ℹ 2 more variables: sex <fct>, year <int>
  3. To view the last 2 rows of the dataset

    penguins %>% tail(2)
    # A tibble: 2 × 8
      species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
      <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
    1 Chinstrap Dream            50.8          19                 210        4100
    2 Chinstrap Dream            50.2          18.7               198        3775
    # ℹ 2 more variables: sex <fct>, year <int>
  4. To view the data in rstudio, execute view(penguins)

Data structures (dataframe, continued)

Exercise

  1. Can you query the dimensions of the penguins dataset using dim(penguins), ncol(penguins), nrow(penguins)?
  2. Can you get a glimpse of the data using the functions glimpse(penguins)?
  3. Can you view summary statistics of the data using summary(penguins)?

Data structures (dataframe, accessing variables)

  1. To get the column/variable-names of your dataset

    colnames(penguins)
    [1] "species"           "island"            "bill_length_mm"   
    [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
    [7] "sex"               "year"             
  2. To access any column variable, we use the $ syntax

    penguins$species[1:10]
     [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
    Levels: Adelie Chinstrap Gentoo
  3. To get a table count of a variable or a cross-table count of 2 variables

    table(penguins$sex, useNA = "ifany")
    
    female   male   <NA> 
       165    168     11 
  4. See Data Cleaning for more ways to work with dataframes

Data structures (dataframe, loading data)

  1. To import data from csv file, use read_csv
  2. To import data from an excel file, use read_excel
  3. To import data from SAS/SPSS/Stata or export from R to these file formats, use the haven package

Data structures (dataframe, adding labels)

  1. To add labels to a dataframe, we make use of the labelled package

Functions

  1. A good chapter on explaining functions is Chapter 6, Advanced R, where most of the examples in the slides are taken

  2. Example function

    f01 <- function(x, y) {
      # A comment
      x + y
    }
  3. Components of a function

    1. Arguments \(\rightarrow\) inputs to the function (there can also be functions with no arguments)
    2. Body \(\rightarrow\) the code inside the function
    3. Environment \(\rightarrow\) a list-like structure that stores/maps the variable name to value (a namespace as some people call it)
  4. Functions are objects that can be assigned or they can be anonymous functions

Function invocation

  1. How do we invoke the function?

    print(f01)
    function(x, y) {
      # A comment
      x + y
    }
    f01(3, 5)
    [1] 8
  2. How do we invoke one function after another?

    1. Suppose we have another function f02

      f02 <- function(a, b) {
        a * 10 + b
      }
    2. To call f01 first followed by f02 and sqrt we will “nest” the functions (inside-out, right to left)

      sqrt(f02(f01(1, 2), 5))
      [1] 5.91608

Function invocation (continued)

Another way to invoke a series of functions is to use the pipe operator %>% provided by the magrittr package

library(magrittr)
value <- 1
value %>%
  f01(2) %>%
  f02(5) %>%
  sqrt()
[1] 5.91608

Exercise

  1. What is the difference between f02 and f02(1, 2)?

Function output

  1. What is returned by a function? The last evaluated expression

  2. How do we store values outputted from a function? With assignment statment

  3. Early termination of a function with a return statement

Functions - argument passing

  1. Pass by copy vs pass by reference

  2. In R, all functions are pass by reference if you don’t modify the object subsequently within the function. i.e. copy-on-modify

  3. How to pass arguments?

    f04 <- function(x, y, z) {
      (x / y) * z
    }
    1. By position e.g. f04(10, 50, 20)

    2. By name e.g. f04(z = 20, y = 50, x = 10)

    3. Unpacking multiple named arguments using do.call

      do.call(f04, list(x = 10, y = 50, z = 20))
      [1] 4

Functions - lexical scoping, name masking

Names defined inside a function mask names defined outside a function

x <- 10
y <- 20
f05 <- function() {
  x <- 1
  y <- 2
  c(x, y)
}

f05()
[1] 1 2

Functions - lexical scoping, looking one level up

R will look up a variable in the enclosing scope (e.g. within the function) and if it’s not found, will continue to proceed upwards (e.g. enclosing function or global environment) to look for the variable until it is found

x <- 1
f06 <- function() {
  y <- 2
  i <- function() {
    z <- 3
    c(x, y, z)
  }
  i()
}

f06()
[1] 1 2 3

Exercise

Using the sample function, change the z variable inside the function and observe that…

z <- 10
f07 <- function(x, y) {
  # make your edit here
  z <- z * 100
  z + x + y
}

…there is no change to the z variable outside the function in the global environment

Functions - lexical scoping, dynamic lookup

R looks for the values when the function is run, not when the function is created

z <- 10
f08 <- function(x, y) {
  z + x + y
}

z <- 100
f08(5, 10)
[1] 115

Functions

Loops

Debugging

Memory Management

Data Cleaning

Common functions - packages

  1. These functions are from various packages, conveniently collected in the tidyverse package
  2. The best package for data wrangling is dplyr. Useful info can be found in this chapter
  3. A package for working with dates is lubridate
  4. A package for working with strings is stringr. This is useful for cleaning free-text response data
  5. A package for working with factor datatype is forcats

Common functions - dplyr::select

The select function enables you to select a subset of the columns in your dataset


# library(tidyverse) # if not already loaded
# library(penguins) # if not already loaded
penguins %>%
  select(species, island, bill_length_mm) %>%
  head(3)
# A tibble: 3 × 3
  species island    bill_length_mm
  <fct>   <fct>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3


# there is also ends_with
penguins %>%
  select(starts_with("bill")) %>%
  head(3)
# A tibble: 3 × 2
  bill_length_mm bill_depth_mm
           <dbl>         <dbl>
1           39.1          18.7
2           39.5          17.4
3           40.3          18  


penguins %>%
  select(matches("mm")) %>%
  head(3)
# A tibble: 3 × 3
  bill_length_mm bill_depth_mm flipper_length_mm
           <dbl>         <dbl>             <int>
1           39.1          18.7               181
2           39.5          17.4               186
3           40.3          18                 195

Common functions - dplyr::filter

The filter function enables you to select a subset of the rows that meet a certain criteria

penguins %>%
  filter(body_mass_g > 3500) %>%
  head(3)
# A tibble: 3 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>
penguins %>% filter(sex == "female")
# A tibble: 165 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.5          17.4               186        3800
 2 Adelie  Torgersen           40.3          18                 195        3250
 3 Adelie  Torgersen           36.7          19.3               193        3450
 4 Adelie  Torgersen           38.9          17.8               181        3625
 5 Adelie  Torgersen           41.1          17.6               182        3200
 6 Adelie  Torgersen           36.6          17.8               185        3700
 7 Adelie  Torgersen           38.7          19                 195        3450
 8 Adelie  Torgersen           34.4          18.4               184        3325
 9 Adelie  Biscoe              37.8          18.3               174        3400
10 Adelie  Biscoe              35.9          19.2               189        3800
# ℹ 155 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Common functions - dplyr::mutate

The mutate function enables you to add columns to your dataset. The added columns can be derived from existing column(s)

penguins %>%
  mutate(body_mass_100g = body_mass_g / 100) %>%
  select(body_mass_g, body_mass_100g) %>%
  head(5)
# A tibble: 5 × 2
  body_mass_g body_mass_100g
        <int>          <dbl>
1        3750           37.5
2        3800           38  
3        3250           32.5
4          NA           NA  
5        3450           34.5
penguins %>%
  mutate(bill_length_plus_depth_mm = bill_length_mm + bill_depth_mm) %>%
  select(matches("bill")) %>%
  head(5)
# A tibble: 5 × 3
  bill_length_mm bill_depth_mm bill_length_plus_depth_mm
           <dbl>         <dbl>                     <dbl>
1           39.1          18.7                      57.8
2           39.5          17.4                      56.9
3           40.3          18                        58.3
4           NA            NA                        NA  
5           36.7          19.3                      56  

Common functions - dplyr::group_by, summarise

  1. The summarise function enables you to get summary statistics like N, mean, median etc

    penguins %>% summarise(N = n(),
                           mean_body_mass_g = mean(body_mass_g))
    # A tibble: 1 × 2
          N mean_body_mass_g
      <int>            <dbl>
    1   344               NA
  2. The group_by function enables you to get summary statistics within groups

    penguins %>%
      group_by(sex) %>%
      summarise(N = n(),
                mean_body_mass_g = mean(body_mass_g),
                median_body_mass_g = median(body_mass_g))
    # A tibble: 3 × 4
      sex        N mean_body_mass_g median_body_mass_g
      <fct>  <int>            <dbl>              <dbl>
    1 female   165            3862.               3650
    2 male     168            4546.               4300
    3 <NA>      11              NA                  NA

Common functions - dplyr::arrange

The function arrange enables us to sort by a certain variable

penguins %>%
  group_by(sex, year) %>%
  summarise(mean_body_mass_g = mean(body_mass_g)) %>%
  arrange(desc(year))
# A tibble: 9 × 3
# Groups:   sex [3]
  sex     year mean_body_mass_g
  <fct>  <int>            <dbl>
1 female  2009            3874.
2 male    2009            4521.
3 <NA>    2009              NA 
4 female  2008            3888.
5 male    2008            4632.
6 <NA>    2008            4650 
7 female  2007            3821.
8 male    2007            4479.
9 <NA>    2007              NA 

Common functions - dplyr::if_else, case_when

The functions if_else and case_when are often used with mutate to create new variables

set.seed(2)
penguins %>%
  mutate(bill_length_type = if_else(bill_length_mm >= 48.5, "long bill", "short bill")) %>%
  sample_n(4)
# A tibble: 4 × 9
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
1 Chinstrap Dream            43.5          18.1               202        3400
2 Gentoo    Biscoe           43.6          13.9               217        4900
3 Gentoo    Biscoe           48.1          15.1               209        5500
4 Gentoo    Biscoe           46.8          14.3               215        4850
# ℹ 3 more variables: sex <fct>, year <int>, bill_length_type <chr>
penguins %>%
  mutate(bill_length_type = case_when(bill_length_mm >= 48.5 ~ "long bill",
                                      bill_length_mm < 48.5 ~ "short bill",
                                      TRUE ~ NA_character_)) %>%
  select(bill_length_type) %>%
  table(useNA = "ifany")
bill_length_type
 long bill short bill       <NA> 
        87        255          2 

Common functions - forcats::fct_relevel

Common functions - forcats::fct_collapse

Common functions - stringr::str_detect

Common functions - dplyr::pivot_wider & longer

Common functions - dplyr::bind_rows, bind_cols

Common functions - dplyr::mutate + across

Data visualisation

The ggplot2 package is well known for plotting figures

Histograms

penguins %>%
  ggplot(aes(x = flipper_length_mm, fill = species)) + geom_histogram(bins = 40)

Facet wrap

penguins %>%
  ggplot(aes(x = flipper_length_mm, fill = sex)) +
    geom_histogram() +
    facet_wrap(~species, dir = "v", scale = "free_y")

Analysis: Descriptive statistics

Analysis: Bivariate tests for independent observations

Variable 1 Variable 2 Bivariate test
Categorical Categorical 1. Chi-square test
2. Fisher’s exact test
Categorical Continuous Parametric:
1. Independent t-test (2 categories)
2. One-way independent ANOVA (>2 categories)
Categorical Continuous Non-parametric:
1. Mann-Whitney test (2 categories) a.k.a. Wilcoxon rank-sum test
2. Kruskal-Wallis test (>2 categories)
Continuous Continuous Parametric: Pearson’s correlation coefficient
Continuous Continuous Non-parametric: Spearman’s correlation coefficient

Analysis: Bivariate tests for repeated measurements

Variable 1 Variable 2 Bivariate test
Categorical Categorical time
(e.g. baseline, 12-month)
McNemar’s test
Continuous Categorical time
(e.g. baseline, 12-month)
Parametric: Dependent t-test
Continuous Categorical time
(e.g. baseline, 12-month)
Non-parametric: Wilcoxon signed-rank test

Measures of central tendency

  1. Mean (SD), median (IQR) helps you get a sense of the distribution of characteristics in your study population with respect to levels of the outcome
  2. The tableby function of the arsenal package lets you do this easily
  3. We will talk about improving upon the formatting of the table in presenting tables
library(arsenal)

1tableby(species ~ sex + island + bill_length_mm,
        data = penguins,
        control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"))) %>%
summary(text = TRUE) %>%
knitr::kable(align = c("l", rep("r", 5)))
1
tableby function invoked by these 3 lines
Adelie (N=152) Chinstrap (N=68) Gentoo (N=124) Total (N=344) p value
sex 0.976
- N-Miss 6 0 5 11
- female 73 (50.0%) 34 (50.0%) 58 (48.7%) 165 (49.5%)
- male 73 (50.0%) 34 (50.0%) 61 (51.3%) 168 (50.5%)
island < 0.001
- Biscoe 44 (28.9%) 0 (0.0%) 124 (100.0%) 168 (48.8%)
- Dream 56 (36.8%) 68 (100.0%) 0 (0.0%) 124 (36.0%)
- Torgersen 52 (34.2%) 0 (0.0%) 0 (0.0%) 52 (15.1%)
bill_length_mm < 0.001
- N-Miss 1 0 1 2
- Mean (SD) 38.791 (2.663) 48.834 (3.339) 47.505 (3.082) 43.922 (5.460)
- Median (Q1, Q3) 38.800 (36.750, 40.750) 49.550 (46.350, 51.075) 47.300 (45.300, 49.550) 44.450 (39.225, 48.500)
- Range 32.100 - 46.000 40.900 - 58.000 40.900 - 59.600 32.100 - 59.600

Measures of central tendency (repeated measures)

  1. The paired function from arsenal package

Analysis: Generalised Linear Models (GLM)

How are categorical variables represented?

  1. Categorical variables are characters, but we need a numeric matrix in order to perform computations to get the \(\beta\) coefficients
set.seed(10)
idx <- sample(1:344, 5)
penguins[idx,]
# A tibble: 5 × 8
  species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie    Dream               35.6          17.5               191        3175
2 Chinstrap Dream               50.7          19.7               203        4050
3 Adelie    Torgersen           39.7          18.4               190        3900
4 Gentoo    Biscoe              43.2          14.5               208        4450
5 Gentoo    Biscoe              47.2          13.7               214        4925
# ℹ 2 more variables: sex <fct>, year <int>
  1. So one-hot encoding / dummy encoding will be applied to categorical variables
model.matrix(~species, penguins)[idx,]
    (Intercept) speciesChinstrap speciesGentoo
137           1                0             0
330           1                1             0
72            1                0             0
211           1                0             1
271           1                0             1

Framework

flowchart LR
  A[Model specification] --> B[Inference] --> C[Diagnostics]

  1. Model specification
    1. How to decide what type of model to fit?
    2. How to decide what variables to choose?
  2. Inference
    1. How do we get the value of the coefficients, confidence intervals, p-values?
  3. Diagnostics
    1. Does the model fit our data well?

Why the need for GLM?

  1. We want to find the association between an outcome (e.g. BMI) and some dependent variables (e.g. age, gender, ethnicity, social economic status)
  2. We are used to fitting a multivariable linear regression \[ \begin{align} \underbrace{BMI}_{\text{continuous}} &= \underbrace{\beta_0 + \beta_1age + \beta_2gender + \beta_3ethnicity + \cdots + \epsilon}_{\text{this linear combination is continuous}} \\ \epsilon &\sim N(0, \sigma^2) \end{align} \]
  3. What if your outcome is not continuous?
    1. E.g. outcome takes on values of 0 or 1 (logistic regression)
    2. E.g. outocme takes on discrete integer values 0, 1, 2, … (poisson regression)
  4. GLM gives you a way to relate the outome (which can take on various distributions) with a linear combination of dependent variables

Formal definition of GLM

GLM consists of 2 components. See notations below1

Component 1: What distribution does your outcome, \(y_i\) take on?

\[\begin{align} y_i|X_i \sim \text{member of exponential family} \ \text{(e.g. $y_i$ is normally distributed)} \end{align}\]

Component 2: What is the link function \(g\) you want to use?

\[\begin{align} g(\mathbb{E}(y_i|X_i)) = X_i^{T} \beta = \underbrace{\beta_0 + \beta_1 age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots}_{\text{this linear combination is continuous}} \end{align}\]
  1. \(g\) is the link function that maps \(\mathbb{E}(y_i|x_i)\) to \(X_i^{T}\beta\)
  2. It is the link function that enables the mapping of the continuous \(X_i^{T}\beta\) to \(y_i\), which can be discrete for example

Formal definition of GLM

  1. In a generalised linear model, the outcome variable \(y_1,...,y_n\) are modelled as independent and identically distributed (iid) observations
  2. The outcome variable is treated as a random variable (i.e. outcome takes on a certain distribution e.g. normal), but NOT the independent variables

Family member 1: Linear regression

  1. When our outcome, \(y_i\) is continuous and takes on real values (i.e. \(y_i \in \mathbb{R}\)), we may choose to model \(y_i\) with a normal distribution
  2. Component 1: \(y_i|x_i\) is normally distributed
\[\begin{align*} y_i | x_i \sim \mathcal{N}(X_i^{T}\beta, \sigma^2) \end{align*}\]
  1. Component 2: link function is the identity function i.e. no transformation done
\[\begin{align*} \mathbb{E}(y_i|x_i) = X_i^{T}\beta = \beta_0 + \beta_1age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots \hspace{2mm} \end{align*}\]

Running a linear regression in R

  1. Notice the use of the formula syntax. LHS of the ~ is the dependent variable, while RHS are the independent variables
  2. Notice the family argument contains the 2 components of the GLM we discussed earlier
model <- 
  glm(body_mass_g ~ species + bill_length_mm +
                    bill_depth_mm + 
                    flipper_length_mm + sex,
      data = penguins,
      family = gaussian(link = "identity"))

summary(model)

Call:
glm(formula = body_mass_g ~ species + bill_length_mm + bill_depth_mm + 
    flipper_length_mm + sex, family = gaussian(link = "identity"), 
    data = penguins)

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1460.995    571.308  -2.557 0.011002 *  
speciesChinstrap   -251.477     81.079  -3.102 0.002093 ** 
speciesGentoo      1014.627    129.561   7.831 6.85e-14 ***
bill_length_mm       18.204      7.106   2.562 0.010864 *  
bill_depth_mm        67.218     19.742   3.405 0.000745 ***
flipper_length_mm    15.950      2.910   5.482 8.44e-08 ***
sexmale             389.892     47.848   8.148 7.97e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 82563.33)

    Null deviance: 215259666  on 332  degrees of freedom
Residual deviance:  26915647  on 326  degrees of freedom
  (11 observations deleted due to missingness)
AIC: 4723.9

Number of Fisher Scoring iterations: 2

Family member 2: Logistic regression

  1. When our outcome, \(y_i\), is discrete and dichotomous (i.e. take two discrete values, 0-1, success-failure), we can model \(y_i\) with a Bernoulli distribution
  2. Component 1: \(y_i|x_i\) is Bernoulli distributed
    1. \(y_i | x_i \sim \mathrm{Bernoulli}(\pi_i)\)
    2. \(\pi_i\) is the probability of success of \(y_i\)
    3. \(\mathbb{E}(y_i|x_i) = \pi_i\)
  3. Component 2: Link function, \(g\), is the logit function
\[\begin{align*} g(\mathbb{E}(y_i|x_i)) &= \underbrace{\mathrm{ln} \left( \frac{\mathbb{E}(y_i|x_i)}{1-\mathbb{E}(y_i|x_i)} \right)}_{logit[\mathbb{E}(y_i|x_i)]} = X_i^{T}\beta = \underbrace{\beta_0 + \beta_1 age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots}_{\text{this linear combination is continuous}} \end{align*}\]

Running a logistic regression in R

  1. To be filled only after completion of exercise
  2. Code is largely similar to linear regression except for:
    1. family = binomial(link = "logit")

Inference

  1. Maximum likelihood estimation is the way to derive the beta coefficients

Interpretation: What do the coefficients mean?

Family member 3: Conway-Maxell-Poisson (CMP) regression

  1. When our outcome variable is a “count”, the poisson regression model is used because the poisson distribution models counts better than say linear regression which assumes that the outcome variable is normally distributed
  2. However, the poisson regression model is restrictive as it assumes equi-dispersion in our outcome variable (i.e. mean of outcome variable approximately equals the variance of the outcome variable) When there is under- or over-dispersion, the poisson distribution no longer models the outcome variable well.
  3. When there is over-dispersion (variance of outcome > mean of outcome), negative binomial regression is often used
  4. When there is under-dispersion (i.e. variance of outcome < mean of outcome), we will use the Conway-Maxwell-Poisson regression model
  5. Useful resources: (i) Video explanations see1,2 (ii) Sellers & Premeaux3 explains the CMP regression and surveys available statistical packages

CMP distribution

  1. The Conway-Maxwell-Poisson distribution differs from the Poisson distribution with the introduction of a parameter \(\nu\)
  2. The \(\nu\) parameter enables the Conway-Maxwell-Poisson distribution to model a wide range of dispersion, and generalises well-known distributions
    1. When \(\nu = 1\), it takes on the Poisson distribution (equi-dispersion)
    2. When \(\nu = 0\) and \(\lambda < 1\), it takes on the geometric distribution (over-dispersion)
    3. When \(\nu \rightarrow \infty\), it takes on the bernoulli distribution (under-dispersion)

CMP regression

  1. Use of the Conway-Maxwell-Poisson (CMP) distribution in the generalised linear model (GLM) setting:
    1. In the original CMP formulation of Sellers and Shmueli4, the relationship between dependent variable \(Y\) and independent variables \(X\) is given by the equations:
      1. \(ln(\lambda_{i}) = X_{i}^T\beta\)
      2. \(\mathbb{E}(Y_{i}) = \lambda_{i} \frac{\partial ln Z(lambda_{i}, \nu)}{\partial \lambda_i} \approx \lambda_{i}^{\frac{1}{\nu}} - \frac{\nu - 1}{2\nu}\), where \(Z(\lambda_i, \nu_i)\) is a normalizing constant
  2. However, the CMP regression model is not amenable to interpretation of coefficients by mean contrasts, because \(\lambda_i\) is related to \(\mathbb{E}(Y_i)\) by a non-linear function
  3. To overcome this, Huang 20175 reparameterised the CMP distribution to model the mean directly. He called it CMP\(\mu\)

CMP\(\mu\)-regression

  1. Under the CMP\(\mu\)-regression formulation, the relationship between dependent variable \(Y\) and independent variables \(X\) is given by \(ln(\mathbb{E}(Y_i|X)) = X_{i}^T\beta\)
  2. The convenient thing about CMP-regression and CMP\(\mu\)-regression is that we can conduct a likelihood ratio test4,5 to see if a poisson regression model (poisson regression is a special case of CMP-regression when \(\nu = 1\)) is adequate (i.e. \(H_0: \nu = 1\) vs \(H_1: \nu \ne 1\))
  3. Note that \(H_1\) does not specify the direction (under vs over) of data dispersion. This can be assessed via the dispersion estimate \(\hat{\nu}\)4 (i.e. \(\hat{\nu} > 1\) if under-dispersion and \(\hat{\nu} < 1\) if over-dispersion)
  4. Model diagnostics5: If the underlying distributional model is correct, the probability inverse transformation (PIT) should resemble a random sample from a standard uniform distribution. Goodness-of-fit can then be assessed by plotting a histogram of the PIT, or via a quantile plot of the PIT against the uniform distribution

Running a CMP\(\mu\)-regression in R

We will use the mpcmp6 R package’s implementation of the CMP\(\mu\)-regression

Analysis: Concordance

What is the Kappa statistic?

  1. The Kappa statistic is a measure of “true” agreement. It indicates the proportion of agreement beyond that expected by chance7
  2. If prevalence index is high, chance agreement will be higher, kappa is reduced
  3. If bias index is high, chance agreement will be lower, kappa is higher

Calculation of Kappa statistic

  1. Calculation of prevalence-adjusted, bias-adjusted Kappa can be done through the epiR package

Analysis: Comparative effectiveness research

Background

  1. Confounding bias

Purpose of propensity score matching

  1. One of the methods to deal with confounding bias in retrospective observational study
  2. Compare the effectiveness of different treatment options8
  3. Compare the effectiveness of a health intervention programme vs usual care group9

How do we choose matching variables?

Analysis: Intervention effect estimation

Presentation of results

Markdown

  1. Markdown is a lightweight markup language that you can use to add formatting elements to plaintext text documents. Created by John Gruber in 2004, Markdown is now one of the world’s most popular markup languages.1
  2. Try out the markdown live preview

R Markdown

  1. R Markdown is an extension to markdown that enables user to execute R code and display the code output in addition to text
  2. The knitr application enables us to execute R code and display output in a temporary markdown document
  3. Pandoc then converts the markdown document to the desired format (e.g. html, pdf, docx, pptx)
  4. See summary by Robin Linacre

Image from R Markdown cookbook chapter 2.1

[DEMO] Creation of an R markdown document

  1. Create an rmarkdown document

  2. Knit your document into html

Components of Rmarkdown document

  1. YAML header: controls the meta data e.g. Title, date, output document format, table of contents
  2. Formatted text of your descriptions, explanations etc.
  3. Code chunks: where you run R code e.g. plot graphs and display the output in the same document

Image taken from An Introduction to R

R markdown YAML header

---
title: "My project"
author: "CRU"
date: '`r format(Sys.Date(), "%d %b %Y")`'
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
bibliography: references.bib
csl: vancouver-superscript.csl 
link-citations: true
---
  1. This is a common setting which I use
  2. bookdown::html_document2 of the Bookdown package is a good output format because it takes care of formatting e.g. auto-numbering of sections, handling of references, appendix
  3. toc (line 7 & 8) refers to table of contents
  4. bibliography, csl and link-citations is for adding citations stored in a bibtex file into your document. More will be covered in a later slide

Consort diagram

  1. The consort package provides a convenient consort_plot function to plot CONSORT diagrams for reporting inclusion/exclusion/allocation (see the vignette)

    library(consort)
    flow_diagram <- consort_plot(data = penguins %>%
                                          mutate(id = row_number(),
                                                 exclusion = case_when(island == "Dream" ~ "Penguins on Dream island",
                                                                       year == 2007 ~ "Data collected in 2007",
                                                                       TRUE ~ NA_character_) %>%
                                                             fct_relevel("Penguins on Dream island",
                                                                         "Data collected in 2007")),
                                 orders = c(id = "Study population",
                                            exclusion = "Excluded",
                                            id = "Data for analysis"),
                                            side_box = "exclusion",
                                 cex = 1.2)
    plot(flow_diagram)

  2. For presenting in html documents, the plot function does not render the plot fully so when that happens, save it as an image first

    # Change the file path to where you want your image to be saved
    ggsave(here("images/consort_diagram.png"), plot = build_grid(flow_diagram))
  3. Next, load your image using knitr::include_graphics

    # You can use the fig.height or fig.width chunk option to rescale your image
    knitr::include_graphics(here("images/consort_diagram.png"))

Presenting tables

pacman::p_load(arsenal, knitr)

tableby(species ~ sex + island + bill_length_mm,
        data = penguins,
        control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"))) %>%
summary(text = TRUE) %>%
1knitr::kable(align = c("l", rep("r", 5)))
1
The kable function converts the table object to html
Adelie (N=152) Chinstrap (N=68) Gentoo (N=124) Total (N=344) p value
sex 0.976
- N-Miss 6 0 5 11
- female 73 (50.0%) 34 (50.0%) 58 (48.7%) 165 (49.5%)
- male 73 (50.0%) 34 (50.0%) 61 (51.3%) 168 (50.5%)
island < 0.001
- Biscoe 44 (28.9%) 0 (0.0%) 124 (100.0%) 168 (48.8%)
- Dream 56 (36.8%) 68 (100.0%) 0 (0.0%) 124 (36.0%)
- Torgersen 52 (34.2%) 0 (0.0%) 0 (0.0%) 52 (15.1%)
bill_length_mm < 0.001
- N-Miss 1 0 1 2
- Mean (SD) 38.791 (2.663) 48.834 (3.339) 47.505 (3.082) 43.922 (5.460)
- Median (Q1, Q3) 38.800 (36.750, 40.750) 49.550 (46.350, 51.075) 47.300 (45.300, 49.550) 44.450 (39.225, 48.500)
- Range 32.100 - 46.000 40.900 - 58.000 40.900 - 59.600 32.100 - 59.600

Formatting HTML tables

The kableExtra package provides useful functions to format tables e.g. changing column width, changing colors, font size, fixed header rows etc.

pacman::p_load(arsenal, knitr, kableExtra)

tableby(species ~ sex + island + bill_length_mm,
        data = penguins,
        control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"))) %>%
summary(text = TRUE) %>%
kable(align = c("l", rep("r", 5))) %>%
kableExtra::column_spec(column = 6, width_min = "2.5cm")
Adelie (N=152) Chinstrap (N=68) Gentoo (N=124) Total (N=344) p value
sex 0.976
- N-Miss 6 0 5 11
- female 73 (50.0%) 34 (50.0%) 58 (48.7%) 165 (49.5%)
- male 73 (50.0%) 34 (50.0%) 61 (51.3%) 168 (50.5%)
island < 0.001
- Biscoe 44 (28.9%) 0 (0.0%) 124 (100.0%) 168 (48.8%)
- Dream 56 (36.8%) 68 (100.0%) 0 (0.0%) 124 (36.0%)
- Torgersen 52 (34.2%) 0 (0.0%) 0 (0.0%) 52 (15.1%)
bill_length_mm < 0.001
- N-Miss 1 0 1 2
- Mean (SD) 38.791 (2.663) 48.834 (3.339) 47.505 (3.082) 43.922 (5.460)
- Median (Q1, Q3) 38.800 (36.750, 40.750) 49.550 (46.350, 51.075) 47.300 (45.300, 49.550) 44.450 (39.225, 48.500)
- Range 32.100 - 46.000 40.900 - 58.000 40.900 - 59.600 32.100 - 59.600

Cross-references & citations

  1. See cross-references and citations on how to add internal cross-references and citations from a bibtex file
  2. See bibliographies and citations on how to add a references and appendix section

References

1.
VideoLecturesChannel. A Flexible Model for Count Data: The COM-Poisson Distribution. 2012.
2.
Consortium R. Observation driven Conway-Maxwell Poisson count data models. 2018.
3.
Sellers KF, Premeaux B. Conway–MaxwellPoisson regression models for dispersed count data. WIREs Computational Statistics. 2020;13(6):e1533.
4.
Sellers KF, Shmueli G. A flexible regression model for count data. The Annals of Applied Statistics. 2010;4(2):943–61, 19.
5.
6.
Fung T, Alwan A, Wishart J, Huang A. Mpcmp: Mean-Parametrizied Conway-Maxwell Poisson Regression. 2020.
7.
Sim J, Wright CC. The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements. Physical Therapy. 2005 Mar;85(3):257–68.
8.
Jaksa A, Gibbs L, Kent S, Rowark S, Duffield S, Sharma M, et al. Using primary care data to assess comparative effectiveness and safety of apixaban and rivaroxaban in patients with nonvalvular atrial fibrillation in the UK: An observational cohort study. BMJ Open [Internet]. 2022/10/18 ed. 2022 Oct;12(10):e064662. Available from: https://www.ncbi.nlm.nih.gov/pubmed/36253039
9.
Lu AD, Gunzburger E, Glorioso TJ, Smith WB 2nd, Kenney RR, Whooley MA, et al. Impact of Longitudinal Virtual Primary Care on Diabetes Quality of Care. J Gen Intern Med [Internet]. 2021/01/24 ed. 2021 Sep;36(9):2585–92. Available from: https://www.ncbi.nlm.nih.gov/pubmed/33483815